Skip to content

Conversation

@a-r-r-o-w
Copy link
Contributor

Fixes #11307

The previous implementation assumed that the layers were instantiated in order of invocation. This is not true for HiDream (caption projection layers are instantiated after transformer layers).

The new implementation makes sure to first capture invocation order and then apply group offloading. In the case of use_stream=True, it does not really make sense to onload more than 1 block at a time, so we also now raise an error if num_blocks_per_group != 1 when use_stream=True

Another possible fix is to simply move the initialization of the caption layers above the transformer blocks.

@sayakpaul @asomoza Could you verify if this fixes it for you?

@a-r-r-o-w a-r-r-o-w requested review from DN6, asomoza and sayakpaul April 21, 2025 10:39
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you.

@sayakpaul
Copy link
Member

sayakpaul commented Apr 21, 2025

I did some testing and we get the following numbers:

No record_stream
=== System Memory Stats (Before encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1942.53 GB

=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (After encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1932.83 GB

=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB

=== System Memory Stats (Before transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1917.84 GB

=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB

=== System Memory Stats (After loading transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1880.56 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:30<00:00,  4.20s/it]
latents.shape=torch.Size([1, 16, 128, 128])

=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 5.68 GB
Max reserved: 5.68 GB
record_stream
=== System Memory Stats (start) ===
Total system memory:    1999.99 GB
Available system memory:1941.94 GB

=== CUDA Memory Stats start ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (Before encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1940.32 GB

=== CUDA Memory Stats Before encode prompt ===
Current allocated: 0.00 GB
Max allocated: 0.00 GB
Current reserved: 0.00 GB
Max reserved: 0.00 GB

=== System Memory Stats (After encode prompt) ===
Total system memory:    1999.99 GB
Available system memory:1930.62 GB

=== CUDA Memory Stats After encode prompt ===
Current allocated: 15.05 GB
Max allocated: 15.05 GB
Current reserved: 15.29 GB
Max reserved: 15.29 GB

=== System Memory Stats (Before transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1915.65 GB

=== CUDA Memory Stats Before transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 0.10 GB
Max reserved: 0.10 GB

=== System Memory Stats (After loading transformer.) ===
Total system memory:    1999.99 GB
Available system memory:1883.74 GB
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [03:14<00:00,  3.89s/it]
latents.shape=torch.Size([1, 16, 128, 128])

=== CUDA Memory Stats After inference with transformer. ===
Current allocated: 0.10 GB
Max allocated: 0.10 GB
Current reserved: 4.30 GB
Max reserved: 4.30 GB

diffusers-cli env:

- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Running on Google Colab?: No
- Python version: 3.10.14
- PyTorch version (GPU?): 2.8.0.dev20250417+cu126 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.30.2
- Transformers version: 4.52.0.dev0
- Accelerate version: 1.4.0.dev0
- PEFT version: 0.15.2.dev0
- Bitsandbytes version: 0.45.3
- Safetensors version: 0.4.5
- xFormers version: not installed
- Accelerator: NVIDIA H100 80GB HBM3, 81559 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the test! Just two comments.

@a-r-r-o-w
Copy link
Contributor Author

Failing tests seem unrelated

@a-r-r-o-w a-r-r-o-w merged commit 6cef71d into main Apr 23, 2025
15 of 16 checks passed
@a-r-r-o-w a-r-r-o-w deleted the fix-block-level-stream-offloading branch April 23, 2025 12:47
option only matters when using streamed CPU offloading (i.e. `use_stream=True`). This can be useful when
the CPU memory is a bottleneck but may counteract the benefits of using streams.
"""
if stream is not None and num_blocks_per_group != 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially breaking no? What if there is existing code with num_blocks_per_group>1 and stream=True? If so, it might be better to raise a warning and set the num_blocks_per_group to 1 if stream is True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has been addressed in #11425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HiDream running into issues with group offloading at the block-level

4 participants